The R Script associated with this page is available here. Download this file and open it (or copy-paste into a new script) with RStudio so you can follow along.

1 RStudio Shortcuts

1.1 Running code

  • ctrl-R (or command-R) to run current line
  • Highlight code in script and run ctrl-R (or command-R) to run selection
  • Buttons:

1.2 Switching windows

  • ctrl-1: script window
  • ctrl-2: console window

Try to run today’s script without using your mouse/trackpad

2 Data wrangling

2.1 Useful packages: dplyr and tidyr

Cheat sheets on website for Data Wrangling

library(dplyr)
library(tidyr)

Remember use install.packages("dplyr") to install a new package.

2.1.1 Example operations from here

2.2 New York City Flights

Data from US Bureau of Transportation Statistics (see ?nycflights13)

library(nycflights13)

Check out the flights object

head(flights)

2.2.1 Object Structure

Check out data structure with glimpse()

glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5,…
## $ arr_time       <int> 830, 850, 923, 1004, 812…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12…
## $ carrier        <chr> "UA", "UA", "AA", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 4…
## $ tailnum        <chr> "N14228", "N24211", "N61…
## $ origin         <chr> "EWR", "LGA", "JFK", "JF…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQ…
## $ air_time       <dbl> 227, 227, 160, 183, 116,…
## $ distance       <dbl> 1400, 1416, 1089, 1576, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 20…

3 dplyr “verbs”

  • select() and rename(): Extract existing variables
  • filter() and slice(): Extract existing observations
  • arrange()
  • distinct()
  • mutate() and transmute(): Derive new variables
  • summarise(): Change the unit of analysis
  • sample_n() and sample_frac()

3.1 Useful select functions

  • -” Select everything but
  • :” Select range
  • contains() Select columns whose name contains a character string
  • ends_with() Select columns whose name ends with a string
  • everything() Select every column
  • matches() Select columns whose name matches a regular expression
  • num_range() Select columns named x1, x2, x3, x4, x5
  • one_of() Select columns whose names are in a group of names
  • starts_with() Select columns whose name starts with a character string

3.1.1 select() examples

Select only the year, month, and day columns:

select(flights,year, month, day)

3.1.2 select() examples

Select everything except the tailnum:

select(flights,-tailnum)

Select all columns containing the string "time":

select(flights,contains("time"))

You can also rename columns with select()

select(flights,year,carrier,destination=dest)

3.2 filter() observations

Filter all flights that departed on on January 1st:

filter(flights, month == 1, day == 1)

3.3 Base R method

This is equivalent to the more verbose code in base R:

flights[flights$month == 1 & flights$day == 1, ]

Compare with dplyr method:

filter(flights, month == 1, day == 1)`

Filter the flights data set to keep only evening flights (dep_time after 1600) in June.

filter(flights,dep_time>1600,month==6)

3.4 Other boolean expressions

filter() is similar to subset() except it handles any number of filtering conditions joined together with &.

You can also use other boolean operators, such as OR (“|”):

filter(flights, month == 1 | month == 2)

Filter the flights data set to keep only ‘redeye’ flights where the departure time (dep_time) is “after” the arrival time (arr_time), indicating it arrived the next day:


filter(flights,dep_time>arr_time)

3.5 Select rows with slice():

slice(flights, 1:10)

3.6 arrange() rows

arrange() is similar to filter() except it reorders instead of filtering.

arrange(flights, year, month, day)

Base R method:

flights[order(flights$year, flights$month, flights$day), ]

3.7 Descending order: desc()

arrange(flights, desc(arr_delay))

Base R method:

flights[order(desc(flights$arr_delay)), ]

3.8 Distinct: Find distinct rows

distinct(
  select(flights,carrier)
)

3.9 Mutate: Derive new variables

Adds columns with calculations based on other columns.

Average air speed (miles/hour):

select(
  mutate(flights,ave_speed=distance/(air_time/60)),
  distance, air_time,ave_speed)

3.10 Chaining Operations

Performing multiple operations sequentially with a pipe character

  1. Group by a variable
  2. Select some columns
  3. Summarize observations
  4. Filter by results

With temporary objects:

a1 <- group_by(flights, year, month, day)
a2 <- select(a1, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
a3 <- summarise(a2,
                arr = mean(arr_delay, na.rm = TRUE),
                dep = mean(dep_delay, na.rm = TRUE))
a4 <- filter(a3, arr > 30 | dep > 30)
head(a4)

If you don’t want to save the intermediate results: wrap the function calls inside each other:

filter(
  summarise(
    select(
      group_by(flights, year, month, day),
      arr_delay, dep_delay
    ),
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)
## Adding missing grouping variables: `year`, `month`, `day`

Arguments are distant from function -> difficult to read!

3.11 Chaining Operations

%>% allows you to pipe together various commands

x %>% f(y) turns into f(x, y)

So you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`

3.12 Analyze by group with group_by()

Perform operations by group: mean departure delay by airport (origin)

flights %>%
  group_by(origin) %>%
  summarise(meanDelay = mean(dep_delay,na.rm=T))

Perform operations by group: mean and sd departure delay by airline (carrier)

flights %>% 
  group_by(carrier) %>%  
  summarise(meanDelay = mean(dep_delay,na.rm=T),
            sdDelay =   sd(dep_delay,na.rm=T))

Flights from which origin airport go the farthest (on average)? Hint: Group by airport (origin) then calculate the maximum flight distance (distance).

flights %>% 
  group_by(origin) %>%  
  summarise(meanDist = mean(distance,na.rm=T))

Which destination airport (dest) is the farthest (distance) from NYC?

flights %>% 
  arrange(desc(distance)) %>% 
  select(dest,distance) %>% 
  slice(1)

Which airport is that?

4 Combining data sets

4.1 dplyr join methods

Drawing

  • left_join(a, b, by = "x1") Join matching rows from b to a.
  • right_join(a, b, by = "x1") Join matching rows from a to b.
  • inner_join(a, b, by = "x1") Retain only rows in both sets.
  • full_join(a, b, by = "x1") Join data. Retain all values, all rows.

4.1.1 Left Join

left_join(a, b, by = "x1") Join matching rows from b to a.

Drawing Drawing

4.1.2 Right Join

right_join(a, b, by = "x1") Join matching rows from a to b.

Drawing Drawing

4.1.3 Inner Join

inner_join(a, b, by = "x1") Retain only rows in both sets.

Drawing Drawing

4.1.4 Full Join

full_join(a, b, by = "x1") Join data. Retain all values, all rows.

Drawing Drawing

flights%>%
  select(-year,-month,-day,-hour,-minute,-dep_time,-dep_delay)%>%
  glimpse()
## Observations: 336,776
## Variables: 12
## $ sched_dep_time <int> 515, 529, 540, 545, 600,…
## $ arr_time       <int> 830, 850, 923, 1004, 812…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12…
## $ carrier        <chr> "UA", "UA", "AA", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 4…
## $ tailnum        <chr> "N14228", "N24211", "N61…
## $ origin         <chr> "EWR", "LGA", "JFK", "JF…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQ…
## $ air_time       <dbl> 227, 227, 160, 183, 116,…
## $ distance       <dbl> 1400, 1416, 1089, 1576, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 20…

Let’s look at the airports data table (?airports for documentation):

glimpse(airports)
## Observations: 1,458
## Variables: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J"…
## $ name  <chr> "Lansdowne Airport", "Moton Field…
## $ lat   <dbl> 41.13047, 32.46057, 41.98934, 41.…
## $ lon   <dbl> -80.61958, -85.68003, -88.10124, …
## $ alt   <int> 1044, 264, 801, 523, 11, 1593, 73…
## $ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -…
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A"…
## $ tzone <chr> "America/New_York", "America/Chic…

What is the name of the destination airport farthest from the NYC airports? Hints:

  • Use a join to connect the flights dataset and airports dataset.
  • Figure out which column connects the two tables.
  • You may need to rename the column names before joining.


select(airports,
       dest=faa,
       destName=name)%>%
  right_join(flights)%>% 
  arrange(desc(distance)) %>% 
  slice(1) %>% 
  select(destName)
## Joining, by = "dest"

4.2 Plot the flights data

The section below shows some ‘advanced’ coding to extract the geographic locations for all flights and plotting. This is just meant as an example to illustrate how one might use these functions to perform a mini-analysis that results in a map.

4.2.1 Join destination airports

library(geosphere)
library(maps)
library(ggplot2)
library(sp)
library(rgeos)
## rgeos version: 0.4-2, (SVN revision 581)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 
##  Linking to sp version: 1.3-1 
##  Polygon checking: TRUE
data=
  select(airports,
         dest=faa,
         destName=name,
         destLat=lat,
         destLon=lon)%>%
  right_join(flights)%>%
  group_by(dest,
           destLon,
           destLat,
           distance)%>%
  summarise(count=n())%>%
  ungroup()%>%
  select(destLon,
         destLat,
         count,
         distance)%>%
  mutate(id=row_number())%>%
  na.omit()
## Joining, by = "dest"
NYCll=airports%>%filter(faa=="JFK")%>%select(lon,lat)  # get NYC coordinates

# calculate great circle routes
rts <- gcIntermediate(as.matrix(NYCll),
                      as.matrix(select(data,destLon,destLat)),
                      1000,
                      addStartEnd=TRUE,
                      sp=TRUE)
rts.ff <- fortify(
  as(rts,"SpatialLinesDataFrame")) # convert into something ggplot can plot

## join with count of flights
rts.ff$id=as.integer(rts.ff$id)
gcircles <- left_join(rts.ff,
                      data,
                      by="id") # join attributes, we keep them all, just in case

Now build a basemap using data in the maps package.

base = ggplot()
worldmap <- map_data("world",
                     ylim = c(10, 70),
                     xlim = c(-160, -80))
wrld <- c(geom_polygon(
  aes(long, lat, group = group),
  size = 0.1,
  colour = "grey",
  fill = "grey",
  alpha = 1,
  data = worldmap
))

Now draw the map using ggplot

base + wrld +
  geom_path(
    data = gcircles,
    aes(
      long,
      lat,
      col = count,
      group = group,
      order = as.factor(distance)
    ),
    alpha = 0.5,
    lineend = "round",
    lwd = 1
  ) +
  coord_equal() +
  scale_colour_gradientn(colours = c("blue", "orange", "red"),
                         guide = "colourbar") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white')) +
  labs(y = "Latitude", x = "Longitude",
       title = "Count of Flights from New York in 2013")
## Warning: Ignoring unknown aesthetics: order

4.3 Colophon

This tutorial has been forked from awesome classes developed by Adam Wilson here: http://adamwilson.us/RDataScience/

This exercise based on code from here.